-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add few missing SparkLikeExpr
methods
#1721
feat: add few missing SparkLikeExpr
methods
#1721
Conversation
Hi @FBruzzesi, I've added some missing methods to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serious great effort @Dhanunjaya-Elluri! Thanks a ton for this, I am very excited π
I was in a bit of a rush but manage to leave some feedback.
I think the main critical point for many methods is null's and/or nan's cases where relevant. The best way to catch different behaviour of methods is to test against data containing null's and/or nan's, by adapting the existing tests with the pyspark_constructor
.
def clip( | ||
self, | ||
lower_bound: Any | None = None, | ||
upper_bound: Any | None = None, | ||
) -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently introduced support for lower_bound
and upper_bound
to be other Expr's.
I am ok to keep it as a follow up, but definitly something we would look forward to.
SparkLikeExpr
methods
ca44572
to
7c9d37d
Compare
Hi @FBruzzesi, made some changes now. Just one question as requested in the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much from me too ππΌ
I just had a brief look from mobile for now, it's great work!
@@ -78,6 +78,29 @@ where `YOUR-GITHUB-USERNAME` will be your GitHub user name. | |||
|
|||
Here's how you can set up your local development environment to contribute. | |||
|
|||
#### Prerequisites for PySpark tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should narwhals suggest (and maintain :)) a guide to install Java for pyspark?
Or should we just add a note to say that pyspark needs Java and add a link to the pyspark documentation?
There may be different ways one wants to install Java on their machine.
For example, on macOS I prefer using openjdk installed via homebrew.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be simple to just say that pyspark needs java installed and add a link to pyspark documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adjusting @Dhanunjaya-Elluri ! We are getting closer and closer π
I just have a couple of comments that need to be addressed
tests/spark_like_test.py
Outdated
data = { | ||
"a": [1.0, None, None, 3.0], | ||
"b": [1.0, None, 4, 5.0], | ||
} | ||
df = nw.from_native(pyspark_constructor(data)) | ||
result = df.select( | ||
a=nw.col("a").n_unique(), | ||
b=nw.col("b").n_unique(), | ||
) | ||
expected = {"a": [2], "b": [3]} | ||
assert_equal_data(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None and nan should also count toward n_unique
: see polars.Expr.n_unique
|
||
|
||
# copied from tests/expr_and_series/is_duplicated_test.py | ||
def test_is_duplicated(pyspark_constructor: Constructor) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test this one and unique
with None's as well?
The expected behaviour is:
data = {"a": [1, 1, 2, None], "b": [1, 2, None, None], "level_0": [0, 1, 2, 3]}
pl.DataFrame(data).select(pl.col("a", "b").is_duplicated())
shape: (4, 2)
βββββββββ¬ββββββββ
β a β b β
β --- β --- β
β bool β bool β
βββββββββͺββββββββ‘
β true β false β
β true β false β
β false β true β
β false β true β
βββββββββ΄ββββββββ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
β¦into feat/spark-expr-methods
2aa436d
to
fbdd61c
Compare
Hey @FBruzzesi, done changes now. Let me know if there needs to be any changes |
Hey @Dhanunjaya-Elluri I just took another look at this. For def is_duplicated(self) -> Self:
def _is_duplicated(_input: Column) -> Column:
from pyspark.sql import Window
from pyspark.sql import functions as F # noqa: N812
# Create a window spec that treats each value separately.
return F.count("*").over(Window.partitionBy(_input)) > 1
return self._from_call(
_is_duplicated, "is_duplicated", returns_scalar=self._returns_scalar
) Similarly for However, |
β¦into feat/spark-expr-methods
Hi @FBruzzesi, thanks for taking deeper look. This looks much simpler π. For now I'll push these changes then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one! Amazing work @Dhanunjaya-Elluri ππΌ
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below
Added the following missing methods in
SparkLikeExpr
:abs
,median
,clip
,is_between
,is_duplicated
,is_finite
,is_in
,is_nan
,is_unique
,len
,n_unique
,round
,skew